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We present the new empirical parameter fc, the most probable usage frequency of a word in a 
language, computed via the distribution of documents over frequency x of the word. This parameter 
allows for filtering the core lexicon of a language from the content words, which tend to be extremely 
frequent in some texts written in specific genres or by certain authors. Distributions of documents 
over frequencies for such words display long tails as x > fc representing a bunch of documents in 
which such words are used in abundance. Collections of such documents exhibit a percolation like 
phase transition as the coarse grain of frequency A/ (flattening out the strongly irregular frequency 
data series) approaches the critical value fc. 

PACS numbers: 89.70.+C, 05.40.-a, 05.45.Tp, 01.20.+X 



Studies of lists of words arranged in terms of their fre- 
quencies belong to the most important domains of quan- 
titative linguistics llj . Detection of most frequent words 
constituting the core lexicon of a language is important 
not only for foreign language learners, but also for various 
practical applications, including text compression (this 
was recognized as early as in 0), speech recognition 
information retrieval etc. It is easy to order words 
with respect to their mean frequency of uses, which is 
typically measured as the number of instances of a word 
normalized by the sample of one million words (ipm, in- 
stances per million words); though the notion of a word 
should cover all word forms, like goes, went, gone for go. 

However, the mean frequency is not a sufficient selec- 
tion criterion, because of the large relative dispersion of 
the word frequencies which vary very much from one text 
to the next especially in ample and diverse collections 
of documents. Some words (like prepositions) occur in 
many texts with predictable rates, others (like pronouns 
or mental verbs) are significantly more frequent for cer- 
tain writers or genres, while some are "contagious" : these 
words (such as proper names, technical terms, abbrevi- 
ations, etc.) appear in just a few documents, but when 
they appear, they are often found in abundance 0. The 
variability of rates of words can be characterized in a vari- 
ety of ways, including the Poisson K- mixtures . It can 
be measured by the coefficient of variation (the standard 
deviation divided by the mean), as in However, the 
coefficient of variation as a measure of relative dispersion 
is not very useful when the average frequency is close to 
zero, which occurs quite often for the semantically loaded 
words. Another way to measure the variability of rates 
for contagious words is to compute the document fre- 
quency (or inverse document frequency, 4]) by counting 
the number of documents the given word is mentioned 
in, the burstiness parameter that is the mean frequency, 
except that it ignores documents with no instances of the 
word (see references in [^), etc. As is evident, each of 



these parameters does not capture much of the hetero- 
geneous structure of word rates series and none of them 
provides any general approach to describe it. 

In this paper, we approach the problem of detection 
of contagious words and selection for the core lexicon 
of a language from a probabilistic point of view. In 
accordance to it, the frequency a; of a word (counted 
as the number of its instances per million words ob- 
served in any document of a given language) is a ran- 
dom positive variable distributed with some (unknown) 
probability density function p{x). Strongly irregular fre- 
quency data series can be flattened out by introducing 
the frequency coarse grain A/. The statistics of any word 
w can be characterized by the number of documents 
Nyj{n, A/) = N[{n- 1)A/ < a; < nAf] for which the 
rate of the word w drops into the n— th frequency inter- 
val for various A/. 

We have found that for any coarse grain A/ the dis- 
tributions of documents N^iji, A/) over n are the asym- 
metric curves having one maximum fc that is the most 
probable frequency at which the word w would appear 
in a randomly chosen document written in the given lan- 
guage. The value of fc is independent of genres, authors, 
and topics of documents and is an intrinsic characteris- 
tic of the word in a contemporary language. Obviously, 
fc varies as the language evolves approaching zero as 
the word becomes obsolete. For the frequencies x close 
to the distribution maximum fc, distributions of docu- 
ments over n are bell shaped, but have anomalous 
tails as \x — fc\ is large enough. We can estimate the 
most probable frequencies fc of words independently by 
two methods: first, from the distributions of documents 
N^{n, Af) as A/ — > 1 and, second, from the distribu- 
tions of authors using the same word in their texts. For 
any word, both methods gave identical values of fc. 

The most probable frequency fc helps to detect the 
content words and to select words for the core lexicon 
of a language. Common words which appear uniformly 
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in most documents (like prepositions, conjunctions, re- 
lational verbs, some size adjectives, etc.) have usually 
relatively high mean rates / (ipm), and fc < /. They 
obviously belong to the core lexicon of a language. 

The use of semantically loaded words depends essen- 
tially on authors, genres, and topics. Despite they are 
not found in a bunch of documents, their mean rates / 
are still very high because of their excessive popularity in 
certain collections of texts, but their most probable rates 
get down /c «C / indicating the presence of long tails 
in distributions of documents Nyj over n as n > fc/Af. 
Eventually, for contagious words found in abundance in 
just a few documents, the distributions over n have 
long tails, however their mean rates / are very low since 
a; = for almost all texts, and fc~f or even fc> I ■ 

Our study of word rates is based on the reference cor- 
pus of Russian 0, which includes more than 4-10^ words 
in 1566 texts, which are balanced in their coverage of vari- 
ous genres: fiction, newspapers, various informative texts 
originally written in Russian from 1980 to 2002. Unlike 
earlier corpora (of about 1 million words), reference cor- 
pora of this size are close to saturation, namely, any col- 
lection of new documents added to the corpus does not 
cause statistically significant changes to the frequency 
and patterns of uses of its words. 

The corpus informs us about the list of 5000 words 
most frequently used in modern Russian ^ . The study of 
fc shows that for about 60% of them their most probable 
frequencies is fc < 50 ipm, and just 2.51% have fc > 
600 ipm. Among the words having typically very high 
most probable frequencies, one can mention conjunctions 
and prepositions, pronouns and relational verbs, some 
motion verbs, and size adjectives. The proper nouns, 
acronyms, technical terms and other semantically loaded 
words typically have comparably small values of fc- 

Document counts in the n— th frequency interval ob- 
viously decrease with n as n > fc/Af. We have found 
that it decays exponentially with n for rather small coarse 
grains A/ < /c. 



Ny,{n, A/) cx exp 



nAf-fc 



Af « fc, (1) 



where £,w plays the role of a "correlation length" of the 
word w and diverges as the scale Af approaches the criti- 
cal value fc sls oc |/c — A/| "™ with the positive index 
ayj which is close to unity for the majority of words (see 
the data of Table QJ. The parameter casts the charac- 
teristic excess of the word frequency x over fc. 

We have observed that the values of for words be- 
longing to the same semantic group (such as relational 
verbs, motion verbs, perception verbs, some size adjec- 
tives, pronouns, conjunctions, and prepositions) are very 
close even if other values of empirical parameters mea- 
suring the variability of their rates are rather diverse. 

For larger scales Af « f^ for the frequency intervals 
with n > fc/Af, the distributions N^^j are scale free, 




FIG. 1: The distribution of documents over the frequencies 
(instances per million words) for the relational verb 'imetj'(to 
have). The most probable frequency for this verb is fc = 640 
instances per million words, the coarse grain is taken as Af = 
50. The distribution has an exponential tail. Statistics on the 
4-10^ words, 1566 texts written in Russian from 1980 to 2002. 



where the index /3tu > 1 (see the data of Table P) . The 




NUn,Af)^{nAf-fc)- 



(2) 



FIG. 2: The distribution of documents over the frequencies 
(instances per million words) for the relational verb 'imetj'(fo 
have) has a power law decaying tail with the exponent f3 = 
3.43 when the coarse grain Af is taken close to fc = 640. 
Statistics on the 4 • 10^ words, 1566 texts written in Russian 
from 1980 to 2002. 

data of Table U show that the value of grows up with 
fc almost linearly (the coefficient of linear correlation be- 
tween fc and (3^ in Table^is 0.93). For the supercritical 
phase Af ^ fc, the tail of the probability distribution 
forms a stretched exponential. 
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TABLE I: Empirical parameters measuring the variability of rates: the mean frequency / (ipm), the number of documents in 
which the word is used, the standard deviation of frequencies Sf, the coefficient of variance Sf/f, the most probable frequency 
fc (ipm), the power exponents aw and /3w Statistics on the 4 • 10^ words, 1566 texts written in Russian from 1980 to 2002. 



Group 


lemma 


/ 


No. of texts 


5f 


5f/I 








Relational 


imetj {to have) 


715.38 


1347 


717.21 


1.00 


640 


1.089 


3.433 


verbs 


bytj [to be) 


10635.78 


1555 


4481.04 


0.43 


9390 


1.108 


5.438 


Motion 


idti {to go) 


1029.18 


1422 


880.59 


0.86 


900 


0.956 


3.536 


verbs 


ehatj {to ride, to travel) 


221.36 


914 


448.66 


2.03 


128 


0.869 


2.149 


Perception 


smotretj {to look at) 


817.28 


1284 


930.40 


1.14 


540 


1.201 


2.574 


verbs 


slishatj {to hear) 


306.46 


1081 


370.51 


1.21 


220 


1.361 


3.249 


Size 


bolshoy {large) 


1602.30 


1487 


908.81 


0.57 


1600 


0.978 


5.006 


adjectives 


malenjkii {small) 


386.17 


1173 


482.61 


1.25 


300 


0.915 


2.828 




visokii {high) 


307.33 


1176 


404.18 


1.32 


300 


1.116 


3.016 




niskii {low) 


73.01 


602 


172.55 


2.36 


60 


0.824 


2.043 


Prepositions 


V {in) 


28450.99 


1566 


8625.48 


0.30 


25200 


1.346 


11.392 


Conjunctions 


i {and) 


35196.38 


1566 


9620.76 


0.27 


32000 


0.947 


11.648 


Pronouns 


on {he) 


17804.82 


1554 


10537.47 


0.59 


10400 


0.952 


5.078 




c\v\ n ( Qn p] 


6651 45 


1530 


61 1 8 09 


0.92 


3300 


0.900 


3.006 


Abstract 


vremya {time) 


1830.26 


1489 


1167.38 


0.64 


1800 


1.148 


6.331 


nouns 


spravedlivostj {justice) 


41.91 


368 


158.17 


3.77 


24 


0.591 


1.762 


References 


sto\{table) 


512.40 


1147 


629.38 


1.23 


300 


1.222 


2.889 


to objects 


dom {house) 


1030.96 


1351 


1088.64 


1.06 


750 


0.867 


3.242 


References 


professor 


179.23 


502 


939.73 


5.24 


40 


0.636 


1.423 


to people 


intelligentsia 


62.64 


284 


1334.78 


21.31 


18 


0.973 


1.021 


Contagious 


KGB 


28.87 


207 


151.82 


5.26 


48 


1.011 


1.034 


words 


Internet 


23.86 


133 


161.61 


6.77 


30 


0.990 


1.072 



Let us note that the asymptotic behaviors and 
(0) are typical for the subcritical phase and the critical 
regime of percolation systems 9] . Herewith, A/ plays the 
role of the order parameter, and fc is its critical value. 
A percolation-Hke phase transition observed with respect 
to a word in the collections of documents in which the 
frequency x of this word exceeds its most probable rate 
fc gives us an evidence of existence of the genres of lit- 
erature. Content words are of particular interest since 
their usage features usually the content of a text, so that 
the corpus of texts in which such words pile up can be 
interpreted as the literature genre possessing a special 
lexicon. 

In this brief report we have studied the empirical dis- 
tributions of documents over the frequencies of Russian 
words computed on the linguistic corpus of 1566 texts 
(of 4 • 10^ words). The approach to the word frequency 
analysis which we have proposed is very general and can 
be applied for any other human (or artificial) language. 
We have introduced the new empirical parameter fc, the 
most probable usage frequency of a word in a contem- 
porary language. The value of fc is independent on au- 



thors, genres, and topics, but obviously varies in time as 
the language evolves. The most probable frequency of 
a word could be useful in studies devoted to the evolu- 
tion of languages. This parameter helps us to handle the 
heterogeneous structure of word rate series and to deter- 
mine whether they represent core lexicon. Distributions 
of documents over frequencies for the semantically loaded 
words which are found in abundance in a few documents 
have remarkably long tails. The typical excess of the 
word rate over fc which plays the role of the correlation 
length in ^ can be used in the automatic recognition of 
grammatical functions of words in a language. We have 
shown that the collections of documents accumulating 
content words exhibit a percolation like phase transition 
uprising the certain genres of literature appropriate of 
specialized lexicons or terminologies. 
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